Computer  Science  Department 


TECHNICAL  REPORT 


THE  NYU  ULTRACOMPUTER  — 
A  GENERAL-PURPOSE  PARALLEL  PROCESSOR 


Allan  Gottlieb,  Ralph  Grishman, 
Clyde  P.  Kruskal,  Kevin  P.  McAuliffe 
Larry  Rudolph,  and  Marc  Snir 

REPORT  NO.  040 

JULY  1981 


NEW  YORK  UNIVERSITY 


Department  of  Computer  Science 
Courant  Institute  of  Mathematical  Sciences 

251  MERCER  STREET,  NEW  YORK,  NY.  10012 


Ultracomputer  Note  #32 


THE  NYU  ULTRACOMPUTER  — 
A  GENERAL-PURPOSE  PARALLEL  PROCESSOR 


BY 


Allan  Gottlieb,  Ralph  Grishman, 

Clyde  P.  Kruskal,  Kevin  P.  McAuliffe, 

Larry  Rudolph,  and  Marc  Snir 

REPORT  NO.  040 

JULY  1981 


This  work  was  supported  in  part  by  the  Applied  Mathematical 
Sciences  Program  of  the  U.S.  Department  of  Energy  under 
Contract  No.  DE-AC02-76ER0  3077,  and  in  part  by  the  National 
Science  Foundation  under  Grant  No.  NSF-MCS76-00116. 


r.( 


The  NYU  Ul t racomputer  --  A  General-Purpose  Parallel  Processor 

Allan  Gottlieb,  Ralph  Grishman,  Clyde  P.  Kruskal, 
Kevin  P.  McAuliffe,  Larry  Rudolph,  and  Marc  Snir 

Courant  Institute  of  Mathematical  Sciences,  NYU 

251  Mercer  St.,  New  York,  NY  10012 

(Extended  Abstract) 


Abstract 

We  present  the  design  for  the  NYU  ultracomputer ,  a 
general-purpose  MIMD  parallel  processor  composed  of  thousands  of 
autonomous  processing  elements.  This  machine  uses  an  enhanced 
omega-network  to  approximate  the  ideal  behavior  of  Schwartz's 
paracomputer  model  of  computation  and  to  implement  efficiently 
the  important  replace-add  synchronization  primitive.  The  novelty 
of  the  design  lies  in  the  enhanced  network,  in  particular  in  the 
constituent  switches  and  interfaces.  We  also  present  the  results 
of  analytic  and  simulation  studies  of  the  network  and  include  a 
sample  of  our  efforts  to  implement  parallel  variants  of  important 
scientific  codes. 
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Within  a  few  years  advanced  VLSI  technology  will  produce  a 
fast  single-chip  processor  including  high-speed  floating-point 
arithmetic.  This  leads  one  to  contemplate  the  level  of  computing 
power  that  would  be  attained  if  thousands  of  such  processors 
cooperated  effectively  on  the  solution  of  large-scale 
computational  problems. 

The  NYU  "ultracomputer"  group  has  been  studying  how  such 
ensembles  can  be  constructed  for  effective  use  and  has  produced  a 
tentative  design  that  includes  some  novel  hardware  and  software 
components.  The  design  may  be  broadly  classified  as  a  general 
purpose  MIMD  machine  using  an  omega-network  to  access  a  central 
shared  memory.  (For  related  designs  see  [1],  [11],  [12],  [lA], 
and  [15].) 

The  major  thrust  of  this  report  is  to  outline  in  some  detail 
the  proposed  hardware  and  the  analytic  and  simulation  results 
upon  which  parts  of  the  design  are  based.  We  also  sketch  some 
system  software  issues  and  a  sample  of  our  ongoing  efforts  to 
produce  parallel  versions  of  important  scientific  codes  (but  the 
reader  should  see  [3]  and  [4]  respectively  for  a  more  detailed 
treatment  of  these  last  two  topics).  Section  2  of  the  present 
report  reviews  the  idealized  computation  model  upon  which  our 
design  is  based;  Section  3  presents  the  machine  design;  Section 
A   analyzes  network  performance;   Section  5  highlights  a  parallel 
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scientific  code;   and  Section  6  summarizes  our  results. 
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2.0  MACHINE  MODEL 

In  this  section  we  review  the  paracomputer  model,  upon  which 
our  machine  design  is  based,  and  the  replace-add  operation,  which 
we  use  for  int erpr ocessor  synchronization.  Although 
paracomputers  are  not  physically  realizable,  we  will  see  in 
Section  3  that  close  approximations  can  be  built. 

2.1  Paracomputers 


An  ideal  parallel  processor,  dubbed  a  "paracomputer"  by 
Schwartz  [10],  consists  of  identical  PEs  sharing  a  common  memory. 
The  individual  PEs  may  also  have  their  own  "private"  memories; 
the  memory  common  to  all  processors  is  called  "shared",  and 
variables  stored  there  are  called  "shared  variables".  Multiple 
processing  elements  (PEs)  can  simultaneously  read  shared  cells  in 
the  same  cycle.  Moreover,  simultaneous  writes  (including  the 
replace-add  operation  described  below)  are  likewise  accomplished 
in  a  single  cycle  and  a  memory  cell  to  which  such  writes  are 
directed  will  contain  some  one  of  the  quantities  written  into  it. 
This  requirement  on  simultaneous  memory  updates  illustrates  the 
(paracomputer)  serialization  principle:  The  effect  of 
simultaneous  actions  by  the  PEs  is  as  if  the  actions  occurred  in 
some  (unspecified)  serial  order.  Note  that  simultaneous  memory 
updates  are  in  fact  accomplished  in  one  cycle.  The  serialization 
principle  speaks  only  of  the  effect  of  simultaneous  actions  and 
not  of  their  implementation. 
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We  stress  again   that   paracomput ers   must   be   regarded   as 

idealized   computational  models  since  physical  fan-in  limitations 

prevent  their  realization.   In  the  next   section   we  review   the 

technique   whereby   Lawrie's   [9]   omega-network   may  be  used  to 

construct  a  parallel  processor  closely   approximating  the   ideal 
paracomputer . 

2.2   The  Replace-add  Operation 


We  now  introduce  a  simple  yet  very  effective  int erpr ocessor 
synchronization  operation,  called  replace-add,  which  permits 
highly  concurrent  execution  of  operating  system  primitives  and 
application  programs.  The  format  of  this  operation  is 
RepAdd(V,e),  where  V  is  an  integer  variable  and  e  is  an  integer 
expression.  We  assume  that  this  indivisible  operation  yields  the 
sum  S=V+e  as  its  value  and  replaces  the  contents  of  storage 
location  V  by  this  sum.  Moreover,  RepAdd  must  satisfy  the 
serialization  principle  stated  above:  If  V  is  a  shared  variable 
and  many  replace-add  operations  simultaneously  address  V,  the 
effect  of  these  operations  is  exactly  what  it  would  be  if  they 
occurred  in  some  (unspecified)  serial  order,  i.e.  V  is  modified 
by  the  appropriate  total  increment  and  each  operation  yields  the 
intermediate  value  of  V  corresponding  to  its  position  in  this 
order.  The  following  example  illustrates  the  semantics  of 
replace-add:  If  V  is  a  shared  variable,  if  PEi  executes 
ANSi  < —  RepAdd(V,ei)   , 
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if  PEj  simultaneously  executes 

ANSj  < —  RepAdd(V,ej) 
and  if  V  is  not  simultaneously  updated  by  yet  another   PEk,   then 
either 

ANSi  < —  V+ei 

ANSj  < —  V+ei+ej 
or 

ANSi  < —  V+ei+ej 

ANSj  <■ —  V+ej 
and,  in  either  case,  the  value  of  V  becomes  V+ei+ej.  The  first 
possibility  corresponds  to  the  serialized  order  in  which  first 
PEi  executes  its  replace-add  and  then  PEj  executes  its 
replace-add;  the  second  possibility  corresponds  to  the  opposite 
serialization. 

It  is  also  possible  to  have  loads,  stores,  and  replace-adds 
all  concurrently  directed  at  the  same  memory  location.  Once 
again  the  serialization  principle  demands  that  the  effect  is  as 
though  these  operations  occurred  in  some  serial  order.  In 
particular,  simultaneous  loads  from  the  same  memory  location  may 
not  yield  identical  results  since  a  simultaneous  store  or 
replace-add  may  intervene. 


The  next  section  presents  a  hardware  design  that  realizes 
replace-add  in  essentially  the  same  execution  time  as  a  load  or 
store  to  shared  memory  and  that  realizes  simultaneous 
replace-adds    updating   the   same   variable   in   a   particularly 
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efficient  manner. 

If  the  replace-add  operation  is  available,  we  can  perform 
many  important  algorithms  in  a  completely  parallel  manner,  i.e. 
without  using  any  critical  (and  hence  necessarily  serial)  code 
sections.  For  example  [3]  presents  a  completely  parallel 
solution  to  the  readers-writers  problem*  and  a  highly  concurrent 
queue  management  technique  that  can  be  used  to  implement  a 
totally  decentralized  operating  system  scheduler.  We  are  unaware 
of  any  completely  parallel  solutions  to  these  problems  using  the 
test-and-set  operation.  Kruskal  [6]  gives  efficient  replace-add 
based  algorithms  for  solving  several  other  important  problems  and 
in  Section  5  we  summarize  work  of  Korn  [5],  which  uses  the 
replace-add  to  parallelize  scientific  codes. 


*  Since  writers  are  inherently  serial,  the  solution  cannot 
strictly  speaking  be  considered  completely  parallel.  However, 
the  only  critical  section  used  is  required  by  the  problem 
specification.  In  particular,  during  periods  when  no  writers  are 
active,  no  serial  code  is  executed. 
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3.0   MACHINE  DESIGN 

In  this  section  we  present  the  design  of  the  NYD 
ultracomputer,  a  machine  that  appears  to  the  user  as  a 
paracomputer ;  of  course  the  single  cycle  shared  memory  access 
can  only  be  approximated.  Specifically  an  enhanced  version  of 
Lawrie's  [9]  omega-network  is  used  to  interconnect  N  =  2**D 
autonomous  PEs  to  a  central  shared  memory  composed  of  N  memory 
modules  (MMs).  Thus,  the  direct  single  cycle  access  to  shared 
memory  characteristic  of  paracomputers  is  replaced  by  an  indirect 
access  via  a  multicycle  interconnection  network.  Each  PE  is 
attached  to  the  network  via  a  Processor  Network  Interface  (PNI) 
and  each  MM  is  attached  via  a  Memory  Network  Interface  (MNI).  We 
do  not  provide  private,  separately  addressable  memory  local  to 
the  PE;  rather,  private  data  is  allocated  in  central  memory  and 
a  large  cache  is  associated  with  each  PE  (see  Section  3.4). 
Figure  1  gives  a  block  diagram  of  the  machine. 
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Figure  1.   Block  Diagrj 


After  reviewing  routing  in  an  omega-network,  we  show  that  an 
analogous  network  composed  of  enhanced  switches  provides 
efficient  support  for  concurrent  replace-add  operations.  We  then 
present  a  detailed  design  for  the  switches  and  conclude  this 
section  by  discussing  the  network  interfaces.  Both  the  PEs  and 
MMs  are  relatively  standard  components;  the  novelty  of  the 
design  lies  in  the  network  and  in  particular  in   the   constituent 
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iwltches  and  interfaces. 


3.1   Routing  In  An  Omega-network 

The  manner  in  which  ah  omega-network  can  be  used  to 
implement  memory  loads  and  stores  is  well-known  (see  [9])  and  is 
based  on  the  existence  of  a  (unique)  path  connecting  each  PE-MM 
pair.  To  describe  the  routing  algorithm  we  use  the  notation  in 
Figure  2:  both  the  PEs  and  the  MMs  are  numbered  using  D-bit 
identifiers  whose  values  range  from  0  to  N-1;  the  binary 
representation  of  each  identifier  x  is  denoted  xD...xl;  upper 
ports  on  switches  are  numbered  0  and  lower  ports  1;  messages 
from  PEs  to  MMs  traverse  the  switches  from  left  to  right;  and 
returning  messages  the  switches  from  right  to  left.  A  message  is 
transmitted  from  PE(pB...pl)  to  MM(mD...ml)  by  using  output  port 
mj  when  leaving  the  stage  j  switch.  Similarly,  to  travel  from 
MM(mD...ml)  to  PE(pD...pl)  a  message  uses  output  port  pj  at  a 
stage  j  switch. 
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Figure  2.   Omega-network  (N=8) 


The  routing  algorithm  just  presented  generalizes  immediately 
to  a  D-stage  network  composed  of  k-input-k-outpu t  switches 
(instead  of  the  2x2  switches  used  above)  connecting  k**D  PEs  to 
k**D   MMs :    Simply   number   the   ports   0   to   k-1  and  write  the 


Ultracomputer  Note  #32 


Page  12 


identifiers  in  base  k.  Although  the  remainder  of  this  section 
deals  exclusively  with  2x2  switches,  all  the  results  generalize 
to  larger  switches,  which  are  considered  in  Section  4. 

We  propose  using  a  packet  switching  network.  Thus,  it  may 
appear  that  both  the  destination  and  return  addresses  must  be 
transmitted  with  each  message.  However,  we  need  transmit  only 
one  D  bit  address,  an  amalgam  of  the  origin  and  destination: 
When  a  message  first  enters  the  network,  its  origin  is  determined 
by  the  input  port,  so  only  the  destination  address  is  needed. 
Switches  at  the  jth  stage  route  messages  based  on  memory  address 
bit  mj  and  then  replace  this  bit  with  the  PE  number  bit  p j ,  which 
equals  the  number  of  the  input  port  on  which  the  message  arrived. 
Thus,  when  the  message  reaches  its  destination,  the  return 
address  is  available. 


Since  multiple  requests  entering  a  switch  may  require  the 
same  output  port,  we  provide  queues  at  each  of  these  ports 
(detailed  in  Section  3.3).  This  policy  also  permits  an  important 
optimization:  When  concurrent  loads  and  stores  are  directed  at 
the  same  memory  location  and  meet  at  a  switch,  they  can  be 
combined  (and  thereby  satisfied)  without  introducing  any  delay  by 
using  the  following  procedure  (some  of  these  optimizations  appear 
in  the  CHOPP  design  [14]): 
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1.  Load-Load:  Forward  one  of  the  two  (identical)  loads  and 
satisfy  each  by  returning  the  value  obtained  from 
memory . 

2.  Load-Store:  Forward  the  store  and  return  its  value  to 
satisfy  the  load. 

3.  Store-Store:   Forward  either  store  and  ignore  the  other. 

Combining  requests  reduces  communication  traffic  and  thereby 
increases  network  bandwidth.  Since  combined  requests  can 
themselves  be  combined,  any  number  of  concurrent  memory 
references  to  the  same  location  can  be  satisfied  in  the  time 
required  for  one  central  memory  access. 


3.2   Implementing  Replace-add 

By  including  adders  in  the  MNIs,  the  replace-add  operation 
can  be  easily  implemented;  moreover,  by  further  augmenting  the 
network,  we  can  combine  concurrent  replace-adds  as  we  combined 
concurrent  loads  and  stores  above.  When  RepAdd(X,e)  is 
transmitted  through  the  network  to  the  MM  containing  X,  the  value 
of  X  and  the  transmitted  e  are  brought  to  the  MM  adder,  and  the 
sum  is  both  stored  in  X  and  returned  through  the  network  to  the 
requesting  PE.  Since  replace-add  is  our  sole  synchronization 
primitive  (and  is  also  a  key  ingredient  in  many  algorithms), 
concurrent  replace-add  operations  will  often  be  directed  at  the 
same  location.   It  is  therefore  crucial  in   a   design   supporting 
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large  numbers  of  processors,  not  to  serialize  this  activity.  As 
shown  below,  by  including  a  few  cells  of  memory  and  an  adder  in 
each  switch,  the  network  can  combine  replace-adds  with  the  same 
efficiency  as  it  combines  loads  and  stores. 

When  two  replace-adds  referencing  the  same  shared  variable, 
say  RepAdd(X,e)  and  RepAdd(X,f),  meet  at  a  switch,  the  switch 
forms  the  sum  e+f,  transmits  the  combined  request  RepAdd (X , e+ f ) , 
and  stores  the  value  f  in  its  local  memory  (see  Figure  3).  When 
the  value  Y  is  returned  to  the  switch  in  response  to 
RepAdd (X, e+f ) ,  the  switch  transmits  Y  to  satisfy  the  original 
request  RepAdd(X,f)  and  transmits  Y-f  to  satisfy  the  original 
request  RepAdd(X,e).  Assuming  that  the  combined  request  was  not 
further  combined  with  yet  another  request,  we  would  have  Y  = 
X+e+f;  thus  the  values  returned  by  the  switch  are  X+e  and  X+e+f, 
thereby  effecting  the  serialization  order  "RepAdd(X,e)  followed 
immediately  by  RepAdd (X , f) " .  The  memory  location  X  is  also 
properly  incremented,  becoming  X+e+f.  If  other  replace-add 
operations  updating  X  are  encountered,  the  combined  requests  are 
themselves  combined,  and  the  associativity  of  addition  guarantees 
that  the  procedure  gives  a  result  consistent  with  the 
serialization  principle. 


To  combine  a  replace-add  operation  with  another  reference  to 
the  same  memory  location  we  proceed  as  follows: 
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1.  RepAdd-RepAdd .  As  described  above,  a  combined  request 
is  transmitted  and  the  result  is  used  to  satisfy  both 
replace-adds . 

2.  RepAdd-Load.   Treat  Load(X)  as  RepAdd(X,0). 

3.  RepAdd(X, e) -Store (X, f) .  Transmit  Store(e+f)  and  satisfy 
the  replace-add  by  returning  e+f. 


Repfldd(X.  e)    —> 

<— T-f 
RepflddCX,  f)     —> 

<-T 

f 

RepflddCX.  e+f) 
<1— T 


Figure  3.   Combining  Replace-Adds > 


3.3   The  Switches 


We   now  detail   an   individual   network   switch,   which   is 

essentially  a   2x2   bidirectional   routing  device  transmitting  a 

message  from  its  input  ports  to  the  appropriate   output   port   on 
the   opposite   side.    The  PE  side  sends  and  receives  messages  to 

and  from  the  PEs  via  input  ports,  called   FromPEi,   where   i=0,l, 

and    output  ports,   called   ToPEi.    Similarly,   the   MM   side 

communicates  with  the  MMs  via  ports   FromMMi   and   ToMMi.     (Note 
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that   in   our   figures   the   To  and  From  ports  are  coalesced  into 
bidirectional  ports.)  „^- 

As  indicated  above,  we  associate  a  queue  with  each  output 
port.  The  head  entry  is  removed  and  transmitted  when  data  is 
requested  by  a  switch  in  the  adjacent  stage.  To  avoid  queue 
overflow,  a  switch  requests  data  only  when  space  is  available  in 
both  its  output  queues.  (In  unlikely  situations,  a  PE  may 
momentarily  be  prevented  from  issuing  central  memory  requests.) 

To  describe  the  process  whereby  requests  are  combined  in  a 
switch,  we  view  a  request  as  consisting  of  several  components: 
function  indicator  (i.e.  load,  store,  r eplace-add ) ,  PE  number  and 
MM  number  (these  are  actually  amalgamated),  address  within  the 
specified  MM,  and  data.  For  ease  of  exposition,  we  consider  only 
combining  homogeneous  requests  (i.e.  requests  with  like  function 
fields);  it  is  not  hard  to  extend  the  design  to  permit  combining 
heterogeneous  requests.  When  a  request,  R-new,  arrives  at  a  ToMM 
queue*,  we  perform  an  associative  search  of  the  requests  already 
in  this  queue  using  as  key,  the  function,  MM  number,  and  address 
from  R-new.  Let  R-old  denote  the  message  in  the  ToMM  queue  that 
matches  R-new  (if  no  match  is  found,  R-new  is  simply  added  to  the 
queue).  Then,  to  effect  the  serialization  R-old  followed 
immediately  by  R-new,  the  switch  performs  the  following  actions: 
If  the  function  requested  is  a  load  or  store,  R-old  together  with 


*  Although  we  use  the  term  queue,  entries  within   the   middle   of 
the  queue  may  also  be  searched  and  modified. 
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the  PE  number  from  R-new  is  placed  into  the  Wait  buffer  (to  await 
the  return  of  R-new  from  memory)  and  R-new  replaces  R-old  in  the 
ToMM  queue.  The  actions  for  replace-add  differ  only  in  the 
treatment  of  the  data  components:  R-new  data  is  placed  into  the 
Wait  buffer  and  the  sum  of  the  two  data  components  is  placed  into 
the  queue . 

Before  presenting  the  actions  that  occur  when  a  request 
returns  to  a  switch,  we  make  two  remarks.  First,  we  propose  two 
Wait  buffers,  one  associated  with  each  ToMM  queue,  since 
accessing  these  buffers  may  be  rate  limiting.  Second,  each  entry 
in  the  Wait  buffer  uniquely  identifies  the  message  for  which  it 
is  waiting  since  the  PNI  will  prohibit  a  PE  from  having  two  or 
more  outstanding  references  to  the  same  memory  location. 

After  arriving  at  a  FromMM  port,  a  returning  request,  R-ret, 
is  routed  to  the  appropriate  ToPE  queue  and  is  used  to  search 
associat ively  the  relevant  wait  buffer.  If  a  match  occurs,  the 
entry  found,  R-wait,  is  removed  from  the  buffer  and  its  function 
indicator,  PE  and  MM  numbers,  and  address  are  routed  to  the 
appropriate  ToPE  queue.  If  the  request  was  a  load,  the  data 
field  is  taken  from  R-ret;  if  a  replace-add,  the  R-wait  data 
field  is  subtracted  from  the  R-ret  data  field. 

To  summarize  the  necessary  hardware,  we  note  that  in 
addition  to  adders,  registers,  and  routing  logic,  each  switch 
requires  two  instances  of  each  of   the   following   memory   units. 
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For  each  unit  we  have  indicated  the  operations  it  must  support. 

1.  ToMM-queue:  Entries  are  inserted  and  deleted  in  a 
queue-like  fashion,  associative  searches  may  be 
performed,  and  matched  entries  may  be  updated. 

2.  ToPE-queue:  Entries  may  be  inserted  and  deleted  in  a 
queue-like  fashion. 

3.  Wait-buffer:  Entries  may  be  inserted  and  associative 
searches  may  be  performed  with  matched  entries  removed. 


3.4   The  Network  Interfaces 

The  PNI  performs  four  functions:  virtual  to  physical 
address  translation,  assemb ly / disas sembly  of  memory  requests, 
enforcement  of  the  network  pipeline  policy,  and  cache  management. 
The  MNI  is  much  simpler,  performing  only  request 
assembly /disassembly  and  the  additions  necessary  to  support 
replace-adds .  Since  the  MNI  operations  as  well  as  the  first  two 
PNI  functions  are  straightforward,  we  discuss  only  pipelining 
policy  and  cache  management. 

Before  detailing  these  two  functions,  we  must  consider  the 
effect  of  pipelining  memory  requests  (i.e.  issuing  a  request 
before  the  previous  one  is  acknowledged).  Since  memory  requests 
may  be  enqueued  as  they  traverse  the  network,  pipelining  multiple 
requests  from  a  given  PE  to  distinct  MMs  results  in  an   arbitrary 
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arrival  order  of  these  requests  at  the  MMs .  For  shared  data  this 
phenomenon  can  violate  the  serialization  principle  (see  [8]).  A 
simple-minded  way  to  prevent  this  anomaly  is  for  the  PNI  not  to 
pipeline  any  requests  involving  shared  variables  (more 
sophisticated  approaches  would  allow  some  such  pipelining). 
Moreover,  as  indicated  above,  our  current  switch  design  requires 
that  the  PNI  not  pipeline  requests  to  the  same  memory  location. 

Since  accessing  central  memory  involves  traversing  a 
multistage  network,  effective  cache  management  is  very  important. 
To  reduce  network  traffic  a  write-back  update  policy  was  chosen: 
When  a  cache  miss  occurs  and  eviction  is  necessary,  updated  words 
within  the  evicted  block  are  written  to  central  memory.  Since 
the  serialization  principle  prohibits  us  from  caching  shared 
variables,  cache  generated  traffic  can  always  be  pipelined. 

In  addition  to  the  usual  operations,  which  are  invisible  to 
the  PE,  our  cache  provides  two  functions,  flush  and  release,  that 
must  be  specifically  requested  and  can  be  performed  on  a  segment 
level  or  for  the  entire  cache.  By  having  these  functions  under 
software  control,  we  are  able  to  reduce  network  traffic. 


The  flush  facility,  which  enables  the  PE  to  force  a 
write-back  of  cached  values,  is  needed  for  task  switching  since  a 
blocked  task  may  be  rescheduled  on  a  different  PE.  To  illustrate 
another  use  of  flush,  consider  a  variable  V  that  is  declared  in 
task  T  and  is  shared  with  T's  subtasks.   Prior  to  spawning   these 
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subtasks,  T  may  treat  V  as  private  (and  thus  eligible  to  be 
cached  and  pipelined)  providing  that  V  is  flushed  immediately 
before  the  subtasks  are  spawned. 


The  release  facility  enables  the   PE   to  specify   that   the 

contents   of   certain  virtual  addresses  are  no  longer  needed;   if 

these  locations  are  cached,  they  may  be  marked  available   without 

being   written   back.    For   example,   private  variables  declared 

within  a  begin-end  block  can  be   released   at  block   exit.    The 
release   operation   reduces   network  traffic  by  lowering  both  the 

eviction  rate  and  the  quantity  of  data  written  back   to   central 
memory  during  a  task  switch. 
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4.0   COMMUNICATION  NETWORK  PERFORMANCE 

Since  the  overall  ult r acomputer  performance  is  critically 
dependent  on  the  communication  network  and  this  network  is  likely 
to  be  the  largest  component  of  the  completed  machine,  it  is 
essential  to  evaluate  carefully'  network  performance  so  as  to 
choose  a  favorable  configuration.  We  have  therefore  analyzed  the 
performance  of  a  simplified  network,  and  have  performed  several 
detailed  simulations.  Consider  first  the  following  simplified 
network  model: 

1.  Each  request  consists  of  a  single  packet. 

2.  Each  switch  has  two  inputs  and  two  outputs. 

3.  Packets  are  independently  and  equiprobably  directed  from 
each  network  input  to  each  network  output. 

A.  The  queues  (and  buffers)  in  the  switches  have  infinite 
capacities . 

5.  The  time  required  to  delete  a  packet  from  an  output 
queue  of  one  switch  and  insert  it  into  an  output  queue 
of  a  switch  at  the  next  stage  is  the  same  time-invariant 
constant  t  for  each  queue.  Moreover,  packets  enter  the 
network  at  times  which  are  multiples  of  t.  This  leads 
to  a  synchronous  network  with  cycle  time  t,  which  we 
henceforth  take  to  be  one. 
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Let  p  be  the  probability  that  a  given  PE  inserts  a  packet 
into  the  network  during  a  given  network  cycle.  The  following 
hold:  in  the  steady  state  p  is  also  the  probability  that  a 
packet  is  transmitted  during  a  network  cycle  on  any  given  line  of 
the  network;  arrivals  of  packets  on  the  two  input  lines  of  each 
switch  are  independent;  the  average  delay  of  a  packet  at  each 
stage  is  1  +  .25p/(l-p);  the  expected  length  of  each  queue  is 
p(l  +  .25p/(l-p));  and  the  number  of  packets  at  each  queue  is 
exponentially  distributed  [7].  Note  that,  independent  of  the 
number  of  stages,  the  network  can  support  any  throughput  less 
than  one  packet  per  PE  per  network  cycle. 

What  values  of  p  do  we  expect  to  encounter  in  practical 
applications?  We  assume  that  the  network  cycle  will  be  at  most 
half  the  average  PE  instruction  execution  time  and  thus  p<c/2i, 
where  i  is  the  number  of  PE  instructions  executed  and  c  is  the 
number  of  central  memory  requests.  Table  1  shows  the  values  of  p 
obtained  by  instrumenting  several  parallel  scientific  codes  under 
the  pessimistic  assumptions  that  p=c/2i  and  no  shared  data  is 
cached  and  the  optimistic  assumption  that  all  references  to  code 
and  private  data  are  satisfied  by  the  cache.   The   codes   studied 
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1.  A  parallel  version   of   part   of   a   NASA   weather   code 
(solving  a  two  dimensional  PDE),  with  16  PEs. 

2.  The  same  program,  with  48  PEs. 

3.  The  TRED2  code  described  in  Section  5. 

A.   A  multigrid  Poisson  PDE  solver,  with  16  PEs. 


I  problem  |  instructions  |  private  mem  |  shared  mem  |   p    | 
I  I  I  references   |  references  |        1 


I  1 

I  2 

I  3 

I  A 


207184  I  26855  |  16680  |  .040  | 

276768  1  30715  |  22941  |  .041  | 

927664  I  184018  |  43348  |  .023  | 

5109088  I  900303  |  323146  |  .032  | 


Table  1.   Probability  of  Request  Insertion. 

Let  us  now  examine  critically  the  first  five  assumptions 
given  at  the  beginning  of  this  section. 

Bandwidth  restrictions  at  the  switches  may  require  that 
messages  be  sent  in  multiple  packets.  If  each  message  is  sent  in 
m  packets,  then  the  maximal  throughput  of  the  network  is  1/m 
messages  per  PE  per  network  cycle  and  the  average  delay  of  a 
message  at  each  stage  is  1+. 25pm**2 / ( 1-mp ) . 


As  mentioned  in  Section  3.1,  one  can  construct 
omega-networks  using  kxk  switches,  for  any  k>_2  ;  however, 
practical  considerations  would  probably  limit  k  to  powers  of  two. 
A   network   composed   of   4x4   switches   has   one  quarter  as  many 
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switches  and  one  half  as  many  stages  as  a  network  composed  of  2x2 
switches.  Assuming  that  propagation  delays  dominate  network, 
processing  delays,  the  added  complexity  of  a  4x4  switch  does  not 
significantly  increase  the  network  cycle  time.  The  average  delay 
at  a  stage  in  a  network  built  of  4x4  switches  is 
1  +  . 375pm**2/ ( 1-mp ) ,  where  m  is  the  number  of  packets  per 
message.  However,  if  the  switch  bandwidth  b  (the  number  of  bits 
the  switch  can  accept  at  each  cycle)  is  kept  constant,  twice  as 
many  packets  per  message  have  to  be  used  for  a  4x4  switch  as  for 
a  2x2  switch,  halving  the  network  bandwidth. 


For  a  network  of  4096  inputs  and  outputs  and  various  values 
of  b  and  p.  Figure  4  contrasts  the  average  transit  time  obtained 
by  using  2x2  and  4x4  switches.  We  estimate,  based  on  pin 
limitations  and  on  address  and  word  sizes,  that  we  can  achieve  a 
switch  bandwidth  of  at  least  2/3  (a  2x2  switch  with  3  packets  per 
message,  or  a  4x4  switch  with  6  packets  per  message).  For  these 
values  of  b,  and  all  values  of  p  given  in  Table  1,  Figure  4  shows 
that  the  transit  time  is  never  significantly  higher  in  a  network 
composed  of  4x4  switches.  Since  the  4x4  network  requires  much 
less  hardware  than  the  2x2  network,  the  former  is  preferable. 
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b   =    bandwidth    (mess.  /  sui  i  tch   cycles) 


0.0S  0.10  0.  IS  0.20  0.2S  0.30  0. 3S  0.40  0.45  0.  SO 

P 


Figure  4.   Transit  Times  for  2X2  and  4X4  Switches. 


Memory  interleaving,  and  perhaps  hashing  of  addresses,  can 
be  used  to  guarantee  that  memory  requests  generated  at  one  PE  are 
uniformly  distributed  over  the  MMs .  Nonetheless,  memory  requests 
are  likely  to  occur  in  bursts,  which  would  degrade  the 
communication  network  performance.  A  more  accurate  evaluation 
requires  extensive  simulation. 
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Since  the  number  of  packets  at  each  queue  is  exponentially 
distributed  and  the  mean  size  is  small,  a  large  number  of  packets 
is  extremely  unlikely  and  we  expect  that  using  queues  of  finite 
capacity  will  not  significantly  degrade  system  performance. 
Preliminary  simulations  suggest  that  queues  of  size  10  are 
adequate,  but  further  analyses  and  simulations  will  be  carried 
out . 

Simulations  have  indicated  that  network  performance  does  not 
change  significantly  when  switch  transmissions  occur 
asynchronously.  In  these  simulations,  service  time  was  still 
assumed  to  be  discrete,  but  packets  were  generated  at  the  network 
inputs  by  a  Poisson  process.  We  hope  to  advance  our 
understanding  of  asynchronous  networks  by  analysis. 


We  routinely  run  parallel  scientific  codes  under  a 
paracomputer  simulator  to  measure  the  speedup  obtained  by 
parallelism  and  to  judge  the  difficulty  involved  in  creating  a 
parallel  program  (see  Section  5).  We  have  recently  modified  the 
simulator  to  reflect  the  proposed  network  design  rather  than  an 
ideal  paracomputer.  Since  an  accurate  simulation  would  be  very 
expensive,  we  used  instead  a  multi-stage  queuing  system  model, 
with  stochastic  service  time  at  each  stage  (a  complete 
description  is  given  in  [13]).  The  parameters  were  set  to 
correspond  to  a  network  with  six  stages  of  4x4  switches, 
connecting  4096  PEs  to  4096  MMs .  A  message  was  modeled  as  one 
packet   if   it   did   not   contain  data  (e.g.  a  load  travelling  to 
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memory)  and  as  three  packets  otherwise,  and  each  queue  was 
limited  to  fifteen  packets.  The  following  table,  which 
complements  Table  1,  summarizes  simulations  of  the  four 
previously  mentioned  codes.  The  time  unit  is  the  PE  instruction 
cycle  time . 


problem  |  extra  running  |  average  central  |   idle  cycles 
I   time  due  to   |      memory       |  per  load  from 
I  network  delay  |    access  time     |  central  memory 


■60% 
■63% 
■28% 
-24% 


5.3 
4.5 
4.9 
3.5 


Table  2.   Network  Induced  Delays. 


In  these  simulations,  the  number  of  requests  to  central 
memory  were  comfortably  below  the  maximal  number  that  the  network 
could  support,  and  indeed  the  average  transit  time  was  not  far 
from  the  minimum.  Since  the  CDC  compiler  used,  often  prefetches 
operands  from  memory,  we  obtained  a  low  number  of  idle  cycles, 
i.e.  cycles  in  which  a  PE  waits  for  the  arrival  of  a  needed 
operand  fetched  by  a  previous  instruction.  For  the  first  two 
programs  no  attempt  was  made  to  reduce  the  number  of  central 
memory  accesses,  so  we  expect  those  statistics  to  be  nearly  a 
worst  case  for  numerical  codes. 
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We  have  recently  constructed  a  detailed  switch  simulator  at 
the  register  transfer  level,  which  will  enable  us  to  obtain  more 
accurate  network  simulations.  We  shall  also  pursue  a  more 
precise  analytical  modeling  of  the  network,  and  a  performance 
analysis  of  caching. 
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5.0   SIMULATION  AND  SCIENTIFIC  CODE  EXPERIMENTS 

As  indicated  above  we  use  an  instruction  level  paracomput er 
simulator  [2]  to  study  parallel  variants  of  scientific  codes. 
Applications  already  studied  include  radiation  transport, 
incompressible  fluid  flow  within  an  elastic  boundary,  atmospheric 
modeling,  and  Monte  Carlo  simulation  of  fluid  structure.  Current 
efforts  include  extending  the  simulator  to  model  the  connection 
network  and  running  codes  under  a  parallel  operating  system 
scheduler . 

The  goals  of  our  paracomputer  simulation  studies  are,  first, 
to  develop  methodologies  for  writing  and  debugging  parallel  codes 
and  second,  to  predict  the  efficiency  that  future  large  scale 
parallel  systems  can  attain.  As  an  example  of  the  approach 
taken,  and  of  the  results  thus  far  obtained,  we  report  on 
experiments  with  a  parallelized  variant  of  the  code  TRED2  (taken 
from  Argonne's  EISPACK  library),  which  uses  Householder's  method 
to  reduce  a  real  symmetric  matrix  to  tridiagonal  form  (see  [5] 
for  details"). 


An  analysis  of  the  parallel  variant  of  this  code  shows  that 
the  time  required  to  reduce  an  N  by  N  matrix  using  P  processors 
is  well  approximated  by 

T(P,N)  =  AN  +  DN**3/P  +  W(P,N) 
where  the  first  term  represents   "overhead"   code   that   must   be 
executed   by  all  PEs  (e.g.  loop  initializations),  the  second  term 
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represents  work  that  is  divided  among  the  PEs ,  and  W(P,N),  the 
waiting  time,  is  of  order  max (N , P** . 5 ) .  We  determined  the 
constants  experimentally  by  simulating  TRED2  for  several  (P,N) 
pairs  and  measuring  both  the  total  time  T  and  the  waiting  time  W. 
(Subsequent  runs  with  other  (P,N)  pairs  have  always  yielded 
results  within  1%  of  the  predicted  value.)  Table  3  summarizes 
our  experimental  results  and  supplies  predictions  for  problems 
and  machines  too  large  to  simulate  (these  values  appear  with  an 
asterisk).  In  examining  this  table,  recall  that  the  efficiency 
of  a  parallel  computation  is  defined  as 

E(P,N)  =  T(1,N)/(P*T(P,N))  . 


I\  I 
I  \  I 
I  \#PE| 
I  \  I 
I   N  \   I 


.eduction  of  Matrices  to  Tridiagonal  Fori 
16       64      256     1024     4096 


I  16  I 

I  32  1 

I  64  1 

I  128  I 

I  256  I 

I  512  I 

I  1024  I 


62% 

2  6% 

7% 

1%* 

0%* 

87% 

60% 

25% 

6%* 

1%* 

96% 

86% 

59% 

27%* 

7%* 

99%* 

96%* 

86%* 

59%* 

24%* 

100%* 

9  9%* 

96%* 

86%* 

58%* 

100%* 

100%* 

99%* 

96%* 

85%* 

100%* 

100%* 

100%* 

9  9%* 

96%* 

Table  3.   Measured  and  Projected  Efficiencies 


Although  we  consider  these  measured  efficiencies 
encouraging,  we  note  that  system  performance  can  probably  be 
improved  even  more  by  sharing  PEs  among  multiple  tasks. 
(Currently  the  simulated  PEs  perform  no  useful  work  while 
waiting.)   If  we  make  the   optimistic   assumption   that   all   the 
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waiting   time   can   be   recovered,   the   efficiencies  rise  to  the 
values  given  in  Table  A. 


\       I 

\      I 

\#PE| 

\    I 

N  \  I 

+, 

16  I 


32  I 

64  I 

128  I 

256  I 

512  I 

1024  I 


.eduction  of  Matrices  to  Tridiagonal  Fori 


16 


64 


256 


1024 


(without  waiting  time) 


7  1% 

37% 

12% 

9  0% 

69% 

35% 

97% 

90% 

68% 

99% 

97% 

90% 

100% 

99% 

97% 

100% 

100% 

99% 

100% 

100% 

100% 

3% 
12% 
35% 
68% 
90% 
9  7% 
99% 


4096 


+ 

0% 

3% 
12% 
35% 
68% 
90% 
97% 


Table  4.   Projected  Efficiencies 
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6.0   CONCLUSION 

Our  simulations  have  conclusively  shown  that  a  paracomputer 
containing  thousands  of  processors  would  be  an  extremely  powerful 
computing  engine  for  large  scientific  codes.  But  such  ideal 
machines  cannot  be  built.  In  this  report  we  have  described  a 
realizable  approximation,  the  ultracomputer;  we  believe  that, 
within  the  decade,  a  4096  PE  ultracomputer  can  be  constructed 
with  roughly  the  same  component  count  as  found  in  today's  large 
machines.  Although  our  ultracomputer  simulations  are  still 
fragmentary,  the  preliminary  results  thus  far  obtained  are 
encouraging . 


To  demonstrate  further  the  feasibility  of  the  hardware  and 
software  design,  we  plan  to  construct  a  64  PE  prototype  using  the 
switches  and  interfaces  described  above  to  connect  commercial 
microprocessors  and  memories. 
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